Non-trivial two-armed partial-monitoring games are bandits 



a 



00 

o 



Andras Antos Gabor Bartok and Csaba Szepesvari 

Computer and Automation Department of Computing Science 

Research Institute University of Alberta 

Hungarian Academy of Sciences Edmonton, Canada 



Abstract 



O 

(N 

bJQ| We consider online learning in partial-monitoring games against an oblivious adversary. 

We show that when the number of actions available to the learner is two and the game is 
' nontrivial then it is reducible to a bandit-like game and thus the minimax regret is @(Vt). 

1 Introduction 



The partial-monitoring games we consider are defined as follows: Two players interact with each 
other in a sequential manner, Learner and Nature. In each time step Learner can choose one of iV 
, actions, while Nature can choose one of M actions. We use the notation n = {1, . . . , n} for any 

integer and denote the actions of both players by integers, starting from 1, so the action sets are N 
and M. At the beginning of the game both Learner and Nature are given a pair of N x M matrices, 
G = (L,H), where L is the loss matrix and H is the feedback matrix. The elements iij of L are 
real numbers and in fact we shall assume that they belong to the [0, 1] interval. The elements hij 
of H could be chosen from any alphabet. However, for the sake of simplicity, and without loss of 
generality (w.l.o.g.), we may assume that the elements of H are also real numbers. Now, still at 
the beginning of the game, Nature decides about the sequence of actions (Ji, J2, . . .) to be played. 
These actions are kept private, i.e., they are not revealed to Learner. Nature's actions will also be 
■ called outcomes. 

The game is played in discrete time steps. At time step t (t — 1,2,...), first Learner chooses 
an action I t based on the information available to him up to time t. The choice of the action may 
be randomized. Upon announcing his action, Learner gets the feedback hi u j t and suffers the loss 
£i u j t . The cycle is then repeated for time step t + 1. It is important to note that the loss suffered 
is not revealed to Learner. 

The goal of Learner is to keep his cumulative loss 

X" 

small, where T denotes the time horizon. Learner's performance is evaluated by comparing his 
cumulative loss to the cumulative loss of the best fixed action from N, 



— t=i 



giving rise to the cumulative expected regret (or simply regret), 

R T {A,G) =E[L t -L* t ], 

where A is the strategy Learner follows. Note that in the definition of L^, the best fixed action 
is selected in hindsight. When the growth rate of regret is sublinear in T, i.e., the average regret 
Rt/T converges to zero, in the long run, Learner can be said to perform as well as an oracle who 
can play this best action in hindsight. 

The problem just described is of major imp ortance in lea r ning t heory since it models a number 
of interesting scenarios includ ing apple tasting iBartok et all 120101 ] , a variant of label efficient pre- 
diction, and dynamic pricing Cesa-Bianchi and Lugosi, 2006|. For further discussion and examples 
see Chapters 2-7 in the book by ICesa-Bianchi and Lugosil [2006j | . 



Given a game G = (L, H), our goal is to find out the growth rate of the minimax regret associated 
with G, and to design strategies that allow Learner to achieve this minimal growth rate. Let the 
worst-case regret of algorithm A when used in G for time horizon T be 

R T (A,G) = sup R T (A,G), 

where the supremum is taken over all outcome sequences. Formally, the minimax regret of game G 
for time horizon T is defined by 

#t(G) = inf R T (A,G) = inf sup R T (A,G), 
A A (j u ...,j T )eM_ T 

where the infimum is taken over all strategies of Learner. Note that, since for constant outcome 
sequences R T {A, G) > 0, also R T (A, G) > and R^(G) > 0. 

Definition 1 A game is called trivial if the minimax regret is either or scales linearly with the 
number of time steps. 

Lemma 1 The following three statements are equivalent: 

a) The minimax regret is zero for each T . 

b) The minimax regret is zero for some T . 

c) There exists an action i € N_ whose loss is not larger than the loss of any other action irrespectively 
of the choice of Nature 's action. 

The proof is in the Appendix. 
2 Previous work 

The growth rate of the minimax regret is strongly influenced by the choice of L and H. Consider, for 
example, the case of so-called full- information games, where the feedback is sufficient for Learner to 
recover Nature's action in each round. In the simplest case, this is represented by hij — j. However, 
from the point of view of the information content of feedback, we get an equivalent situation when 
each row of H is composed of pairwise distinct elements. The following result is known to hold for 
these games: 

Theorem 2 Consider a full- information game G when Learner has N actions. Then there exists 
an algorithm A such that for any time horizon T, Rt(A, G) < W (T/2) In N. 

Algorithm A in the theo rem above can be the Exponen tially Weighted Average Forecaster with 
appropriate tuning (see e.g.. ICesa-Bianchi and Lu gosi [2006, Corollary 4.2]). 

Another special case is when the only information that Learner recei ves is the loss of the ac- 
tion t aken (i.e., when H = L), which we call the bandit case, foll owing ICesa-Bianchi and Lugosil 
2006] . Then, the INF algorithm due to lAudibert and Bubeckl [2009] is known to achieve a constant 
multiple of the minimax regret: 

Theorem 3 Take a bandit game G when Learner has N actions. Then there exists an algorithm 
A such that Rt(A, G) < 15V NT. Further, for any N there exists a game G such that for any time 
horizon T, ii^(G) > 1/20VnT. 



The lower bound on the minimax regret is due to lAuer et al. |2002| (also. Cesa-Bianchi and Lugosil 



[20061 Theorem 6.111), while the u pper bound is due to lAudibert and Bubeckl [20091] . fThe Exp3 



The following theorem, due to 



algorithm due to lAuer et al.l [2002 1 achieves the sam e upper bound up to logarithmic factors.) 



Antos et al.l [201 1 1 . is a lower bound for any non-trivial game. 



Theorem 4 If G is a non-trivial partial-monitoring game then there exists a constant c > such 
that for any T, R^(G) > cVT. 

Now, consider the game G = (L, H) with 
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That is, the first action of Learner gives full information about the outcome, but it has a high cost, 
while the other two actions do not reveal any information. Further, the orderi ng of actions 2 and 3 
by co sts is reversed based on the choice of Nature. Then, the following holds [Cesa-Bianchi et all . 
2006], Theorem 5.1 : 
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2/3\ 



Theorem 5 The above game has i?J(G) = il(T 

This shows that the above game is intrinsically harder t han a bandit problem. Further, the 



algorithm FeedExp3 by 



Cesa-Bianchi et all [2006 1 



Piccolboni and Schindelhauerl |200lj is known to achieve this growth-rate 



Theorem 6 Consider any partial-monitoring game G = (L, H) such that L = KH for some matrix 
K. Then, there exist an algorithm A such that Rt{A,G) < CT 2 ^ 3 , where C depends on N and 

k ^— max^ j | A/^-/ 1 • 

Thus, we see t hat the difficulty of a game depends on the structure of L and H. Recently, 
iBartok et al.l j2010] classified almost all games by their difficulty when the number of actions available 
to Nature is limited to M = 2. In effect, they showed that the exponent in the dependence of the 
minimax regret on T in these games is one of {0, 1/2, 2/3, 1}. 

In this short communication, we investigate the dual case when the number of actions available 
to Nature is not restricted, but the number of actions available to Learner is limited to N — 2. 

3 Result 

In this section we state and prove that, in essence, any non-trivial two-action game can be viewed 
as a bandit game. 

We need some preparations. First, we will make use of the following concept: 

Definition 2 Take two games, G = (L, H), G' = (L',H'), where L, V , H, and H' are N x M 

matrices. We say that G' is simulation-and-regret-not-harder than G (or easier for short, denoted 
by G' < G) when the following holds: Fix any algorithm A. Then, one can find an algorithm A' 
such that the behavior of A on G can be replicated by using A' on G' in the sense that for the same 
outcome sequences, the two algorithms will choose the same action sequences and the regret in the 
second case is at most the regret in the first case, that is, Rt{A' , G') < Rt{A, G). 

We say that G and G 1 are simulation-and-regret-equivalent (or equivalent, G' ~ G) when both 
G' < G and G < G 1 . 

Clearly, < is a preorder and ~ is an equivalence relation on the set of N x M games, moreover, 
if G' < G then R^(G') < R^(G), and if G ~ G' then their minimax regret is the same. 
We need a few simple lemmata on these relations of games: 

Lemma 7 The regret of a sequence of actions in a game does not change if the loss matrix is 
changed by subtracting the same rea l number from each coordinate of one of its columns (see e.g., 
\Piccolboni and Schindelhaueft \200ft l). Therefore, letting 1 = (1,...,1) T G R N , v e R M , and 
G' = (L - lv T , H), we have that G ~ G'. 

Lemma 8 If G = (L,H) and G' = (L,H') differ only in their feedback matrices and H' can be 
obtained by h'^ = fi(hij) with the help of some mappings ft (i € NJ then G < G'. If each /j is 
infective then G ~ G'. 

In what follows, a transformation of some game into another game that takes either the first or 
the second form just defined shall be called an admissible transformation. 

The following proposition shows that if a 2-armed partial-monitoring game is non-trivial then 

there is no loss in generality by assuming that L = KH for some K £ R 2x2 . This statement for arbi- 

trary N and most of the ideas for its proof could be extracted from the paper of lPiccolboni and Schindelhauerl 
2001]. An exact detailed proof for N = 2 is included here for the sake of completeness. 

Proposition 1 Let Gq — (Lo,Ho) be a non-trivial 2-armed partial-monitoring game. Then, there 
exist matrices L,H e R 2xAi such that G < G = (L,H) and L = KH for some K <E K 2x2 . 

Proof: [Proof of Proposition!]] First, we transform Lo to L using Lemma[7]with v T being its first row. 
Thus, the first row of L becomes identically zero, and we get a non-trivial game Gi = (L, H ) ~ G . 
Let I denote the transpose of the second row of L. In what follows we construct the matrix H using 
an admissible transformation of Ho defined in Lemma 

We construct matrix A in the following way. Assume that there are mi (777,2) distinct entries in 
the first (respectively, second) row of Ho, and transform Ho by two injective mappings (Lemma [8]) 
such that the elements of its i th row (i € 2) are from rm. We define the matrices Ai G W niXM as 
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H 



12 3 1 
12 2 2 



.4 



/l 1\ 
10 
10 
10 

\0 1 1 1/ 



Figure 1: An example for the construction of matrix A used in the proof of Proposition [TJ The 
first three rows of A are constructed from the first row of Ho which has three distinct elements, the 
remaining two rows are constructed from the second row of Hq. For more details, see the text. 



follows: Let each row of Ai be the "indicator" row of the corresponding value of the i row of Hq 



that is, [Ai]jk d = ^{[H a ] ik =j}- Define A by stacking these matrices on top of each other: 

A 



-4i 
A 2 

See Figure [1] for an example. 

The following lemma, proven in the Appendix, is key to prove Proposition [T] 

Lemma 9 If £ ^ Im4 T then Gi is trivial. 

Using the assumption that Gi is non-trivial, we have from Lemma [S] that £ £ Im A T must hold. 
That is, £ can be written as a linear combination of the rows of A: 

m 

£ = ^ A * a *' 
1=1 

where m = mi + m 2 and the vectors &J are the rows of A. Let 



hi =]TA t 



and 



£ • 

i=m\-\-l 



Finally, let 



H 



-(K 



and G = (L, H). Now if the fc th and k' th entries of the first row of Ho are identical then [a^j, = [a,;]^ 
for 1 < % < mi, hence also [hi]/. = [hi]fc/. The same holds for the second row of Ho and h 2 . Thus 
H can be obtained by appropriate mappings from Ho, and Lemma [8] implies Gi < G. 
On the other hand, setting 

'0 



K 



1 1 



(2) 



we have that L = KH. ■ 

The following Proposition is more than what we need, but it is interesting in itself: 

Proposition 2 Let G = (L, H) be a 2-armed partial-monitoring game such that L = KH for some 
K £ K 2x2 . Then, there exist a 2 x M bandit game G' such that G < G'. //K is given by ([2]) then 
G ~ G'. 



(V. W) > G that satisfies V 



H'. Let K = [fcy] 2X 2 



Proof: We will construct a bandit game G' 
and 

D = diag(fcn - fc 2 i, fc 22 - k 12 ) 
be a 2 x 2 diagonal matrix, and define the feedback matrix of G' by H' = DH. Then, both rows of 
H' are scalar multiples of the corresponding rows of H. Hence, by these mappings and Lemma [8l 
G < (L, H'). If K is given by @ then D = diag(— 1, 1), thus both mappings are injective and 
G ~ (L, H'). On the other hand, K — D = lk T where k T = (fc 2 i, &i 2 ). Consider the loss matrix 

L'=' L-l(k T H). 
By Lemma[7l G' = (L',H') ~ (L, H'). Moreover, 

L' = L (lk T )H = L (K D)H = DH = H'. 
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Remark 1 It is worth to consider why the above proof works only for N — 2. We used that from 
any 2x2 matrix K we can subtract a diagonal matrix resulting in a matrix with identical rows. For 
N > 3, this obviously does not hold (there is not enough "degrees of freedom"). Indeed, for N > 3, 
we have regret rates between 0(v / T) and Q(T), for example, Theorem^ and\^show that the game 
in ([1]) has minimax regret rate 0(T 2 / 3 ). 

Now, we are ready to prove our main result. 

Theorem 10 Each non-trivial 2- armed partial- monitoring game is easier than an appropriate 2xM 
bandit game. Consequently, its minimax regret is 6(vf), where T is the number of time steps. 

Proof: According to Proposition[T]and[2j if Go is non-trivial then we can construct first G = (L, H) 
such that L = KH and G < G, then a 2 x M bandit game G' such that G < G'. Thus G < G', 
that implies R^(G ) < R^(G') = O(VT) by Theorem^ finishing the proof. ■ 



Appendix 

Proof: [Proof of Lemma [1] a)— >b) is obvious 
b)— s-c) For any A, 



sup R T (A, G) > sup E [L T - L* T ] 
(j 1 ,...,j T )eM T jeM,j 1= ---=.; T =j 



sup E 



T 

' J i£N J 



> sup ( E [i Iuj ] - mmiij ) d = f(A). 



jeM 
b) leads to 

= i?r(G) = inf sup R T (A, G) > inf f(A). 

A (Ju...,Jt)£M. t A 

Observe that f{A) depends on A through only the distribution of Ii on N denoted by q = q(A) 
now, that is, f{A) ~ f'{q). This dependence is continuous on the compact domain of q, hence the 
infimum can be replaced by minimum. Thus min g f'(q) < 0, that is, there is a q that for all j £ M, 
E[^jj] = minigjv t%j- This implies that the support of q contains only actions whose loss is not 
larger than the loss of any other action irrespectively of the choice of Nature's action. 

c)— 7-a) The algorithm that always plays i has zero regret for all outcome sequences and T . ■ 

Proof: [Proof of Lemma [9] £ ^ Im A T implies (£) lmA T , that is equivalent to £ L ^ Ker A, which 
can be seen by taking the orthogonal complement of both sides and using (Ker A) 1 - = ImA T . The 
latter implies that there exists v such that v £ Ker A but £ T v ^ 0. We may assume w.l.o.g. that 
£ T v > (otherwise take —v). Note that, since the first mi rows of A add up to 1 and v € Ker^, 
the coordinates of v sum to zero. 

Let A C R M denote the M-dimensional probability simplex. If p £ A M is a distribution over 
Nature's actions M, then it is easy to see that the first mi coordinates of Ap give the probability 
distribution of observing the different values of the first row of Ho while Learner chooses action 1 
assuming Nature chooses her actions from p. The same applies to the last coordinates of Ap and 
action 2. It follows that if Ap\ = Ap2 for two distributions then no algorithm can distinguish them. 
We find such pi, p 2 and apply this idea as follows: 

If for all p £ A M , £ T p > (or £ T p < 0), then Gi has zero minimax regret and thus it is 
trivial. Otherwise, there exist p+ and p- with £ T p+ > and £ T p- < 0. Now either there exists 
Po £ Int(A M ) such that £ T po = 0, or we can assume w.l.o.g. that one of p + and p- is in Int(A M ), 
in which case there must be again a po £ Int(A M ) on the segment p+pZ such that £ T po = by the 
continuity of £ T p in p. In other words, we have a distribution po over M_ such that po is not on the 
boundary of the probability simplex and the expected loss of the two actions are equal. 

Now let p\ = pa + ev and pi = po — ev for some e > 0. If e is small enough then both p\ and p2 
are on the probability simplex A M . Since Av = we have that Api = Api. 

Given a. p £ A M , we use randomization such that J\ , . . . , Jf is replaced by a vector J\ , . . . , Jy £ 
M T of i.i.d. random variables distributed according to p, independent of the randomization in the 
algorithm. Let A be an arbitrary strategy of Learner. For k £ 2, given that the outcome distribution 
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is pk, let Pfc[-] be the probability of an event and E/J-] be the expectation of a random variable. Then 
the worst case regret of A is 



sup R T (A, d) > E k [R T (A, d)] 

(Ji,...,Jr)eM T 



= Ei 



.4=1 
' T 

E- 



/62 



E^ 



i, Jt 



E^>° 



-t=l 

?y = o, e 2j = lj) 



T / T \ 

> £ E fc [l {/t=2} ] E^ Jt - min £ E fe £ Jt ,0 
t=i \t=i / 



(by the independence of It and Jt, and Jensen's inequality for min) 



= £p fc [I i = 2]£ 1 



where 



is the expected number of times A chooses action 2 under p k up to time T. Observe that Api = Ap 2 
means that for both actions, the feedback distribution is the same under outcome distributions 
Pi and p2, implying (by induction) that for each t > 1, Pi [It = 2] = Pa [It = 2]. This leads to 



Mti = MT2 = f = Mt(-4)- Moreover, using l T po — and ^ T t> > 

e T p kf i T + T{-e T p k )+ = 



e£ T v[iT if k = 1, 

el T v{T~^ T ) ifjfe = 2. 



Thus we have 



i?J(G!) = inf sup i? T (A Gi )> infmax(£ l p fc /i T + T(-^ l pfe) + ) 



(Ji,...,J T )GM T 



.A fc£2 



r£ 1 u inf max(^ T , T - u T ) > el 1 uT/2, 
.A 



that is, Gi is trivial. 
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